1 Corresponding author, e-mail:
2 Code and workflow design.
3 Instituto Multidisciplinar para el Estudio del Medio ā€œRamon Margalefā€, Universidad de Alicante, Edificio Nuevos Institutos, Carretera de San Vicente del Raspeig s/n, 03690 San Vicente del Raspeig, Spain.
4 Institut Department of Environmental Systems Science, ETH Zürich. Universitätstrasse 16, 8092 Zurich, Switzerland.
5 Departamento de EcologĆ­a, Universidad de Alicante, Carretera de San Vicente del Raspeig s/n, 03690 San Vicente del Raspeig, Alicante, Spain.

1 Summary

The MOISCRUST database contains volumetric water content (VWC, m³/m³) measures taken by soil moisture sensors EC-5, Decagon Devices Inc., Pullman, USA) every 120 minutes (17th November 2006 to 31th January 2017) and 150 minutes (1st February 2017 to 16th December 2020) from three replicates in five different microsites (Stipa tussocks, Retama shrubs, and areas with low, medium and high cover of biocrust-forming lichens) located in The Aranjuez Experimental Station (central Iberian Peninsula, 40⁰02’ N, 3⁰32’W; 590 m.a.s.l). During the long time-span these sensors have been working, there have been periods when data capture has not possible due to technical issues, and as consequence, 34.63% of the database records are missing entries. This reproducible workflow describes in detail the method used to impute missing data in the MOISCRUST database.

The MOISCRUST database and this reproducible document are distributed under the license Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

2 Reproducing this workflow

This reproducible workflow is available as an interactive Rstudio notebook in the file moiscrust.Rmd stored in this repository. It is packaged with renv to facilitate reproducibility. To run it in your computer, execute the code chunk below to prepare the session. You will need to replace reval = TRUEwithr eval = TRUE in the header of the code chunk.

3 Data loading and preparation

3.2 Loading and preparing the raw data

The raw data, stored in the file moiscrust_raw.csv was compiled by the members of Maestre Lab from the data provided by the soil moisture sensors.

date time retama5094 retama5062 retama5063 stipa5094 stipa5062 stipa5063 bs_cl5094 bs_cl5062 bs_cl5063 bs_cm5094 bs_cm5062 bs_cm5063 bs_ch5094 bs_ch5062 bs_ch5063
17/11/2006 18:00:00 0.197 NA NA 0.132 NA NA 0.232 NA NA 0.205 NA NA 0.121 NA NA
17/11/2006 20:00:00 0.195 NA NA 0.131 NA NA 0.233 NA NA 0.203 NA NA 0.117 NA NA
17/11/2006 22:00:00 0.194 NA NA 0.131 NA NA 0.234 NA NA 0.203 NA NA 0.117 NA NA
18/11/2006 0:00:00 0.194 NA NA 0.131 NA NA 0.234 NA NA 0.203 NA NA 0.116 NA NA
18/11/2006 2:00:00 0.194 NA NA 0.130 NA NA 0.234 NA NA 0.203 NA NA 0.116 NA NA
18/11/2006 4:00:00 0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.116 NA NA
18/11/2006 6:00:00 0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.115 NA NA
18/11/2006 8:00:00 0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.115 NA NA
18/11/2006 10:00:00 0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.115 NA NA
18/11/2006 12:00:00 0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.116 NA NA
18/11/2006 14:00:00 0.195 NA NA 0.131 NA NA 0.234 NA NA 0.202 NA NA 0.116 NA NA
18/11/2006 16:00:00 0.195 NA NA 0.131 NA NA 0.233 NA NA 0.201 NA NA 0.115 NA NA
18/11/2006 18:00:00 0.194 NA NA 0.130 NA NA 0.233 NA NA 0.200 NA NA 0.114 NA NA
18/11/2006 20:00:00 0.193 NA NA 0.129 NA NA 0.233 NA NA 0.199 NA NA 0.112 NA NA
18/11/2006 22:00:00 0.192 NA NA 0.129 NA NA 0.234 NA NA 0.199 NA NA 0.111 NA NA
19/11/2006 0:00:00 0.192 NA NA 0.128 NA NA 0.234 NA NA 0.199 NA NA 0.111 NA NA
19/11/2006 2:00:00 0.191 NA NA 0.128 NA NA 0.234 NA NA 0.198 NA NA 0.110 NA NA
19/11/2006 4:00:00 0.190 NA NA 0.127 NA NA 0.234 NA NA 0.198 NA NA 0.109 NA NA
19/11/2006 6:00:00 0.189 NA NA 0.127 NA NA 0.234 NA NA 0.198 NA NA 0.109 NA NA
19/11/2006 8:00:00 0.188 NA NA 0.127 NA NA 0.234 NA NA 0.198 NA NA 0.109 NA NA

3.4 Reordering time and data columns and arranging the data into long format

At this point, the MOISCRUST data has 15 columns representing soils moisture measures (microsites per three replicates per microsite), and 10 columns representing time. The data is also structured in what is known as a ā€œwide formatā€. Below we reorder these columns to facilitate arranging the data into a ā€œlong formatā€.

date_time date_time_id date time year year_day month month_day week week_day sensor soil_moisture
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 retama5094 0.197
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 retama5062 NA
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 retama5063 NA
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 stipa5094 0.132
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 stipa5062 NA
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 stipa5063 NA
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 bs_cl5094 0.232
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 bs_cl5062 NA
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 bs_cl5063 NA
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 bs_cm5094 0.205
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 bs_cm5062 NA
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 bs_cm5063 NA
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 bs_ch5094 0.121
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 bs_ch5062 NA
2006-11-17 18:00:00 1 2006-11-17 18:00 2006 321 11 17 46 6 bs_ch5063 NA
2006-11-17 20:00:00 2 2006-11-17 20:00 2006 321 11 17 46 6 retama5094 0.195
2006-11-17 20:00:00 2 2006-11-17 20:00 2006 321 11 17 46 6 retama5062 NA
2006-11-17 20:00:00 2 2006-11-17 20:00 2006 321 11 17 46 6 retama5063 NA
2006-11-17 20:00:00 2 2006-11-17 20:00 2006 321 11 17 46 6 stipa5094 0.131
2006-11-17 20:00:00 2 2006-11-17 20:00 2006 321 11 17 46 6 stipa5062 NA

3.6 Number of NA per sensor

Due to technical constraints, there is a large number of missing data in the MOISCRUST dataset. To better understand the extent of such missing data, the code below counts the number of missing entries per sensor.

sensor microsite na_count na_count_percent
bs_ch5094 biocrust_high 7674 16.5
stipa5094 stipa 10553 22.7
bs_cl5094 biocrust_low 10987 23.6
bs_cl5063 biocrust_low 11114 23.9
bs_cl5062 biocrust_low 11188 24.1
bs_ch5062 biocrust_high 11776 25.3
bs_ch5063 biocrust_high 14125 30.4
stipa5062 stipa 15363 33.0
stipa5063 stipa 15604 33.5
bs_cm5094 biocrust_medium 16934 36.4
bs_cm5062 biocrust_medium 18743 40.3
retama5063 retama 20505 44.1
retama5094 retama 22840 49.1
retama5062 retama 22999 49.4
bs_cm5063 biocrust_medium 31225 67.1

4 Imputation of missing data

The imputation of missing data we implement here to impute missing data in MOISCRUST works by finding, for a given entry y with missing data at a time t, the sensor x with data for t that is in the same type of microsite (if possible), has the longest extent in common, and the highest correlation with the sensor to which y belongs, and estimates y with the linear model y ~ x.

4.1 Developing criteria to find candidates for gap filling

To generate the criteria to find the best possible candidate x to estimate the missing data y, we compute the common length and correlation between all pairs of sensors, and generate a column indicating whether they belong to the same microsite or not. With the values stored in these columns we compute a selection score based on the following expression:

\[S_{x} = \%vc_{x, y} + (R_{x, y}^2 * 100) + \left\{ \begin{array}{ll} 100, & \mbox{if $microsite_{x} == microsite_{y}$}\\ 0, & \mbox{otherwise} \end{array} \right. \]

Where:

  • \(y\) is the sensor with a missing value to be estimated.
  • \(x\) is the sensor to be used as candidate predictor to estimate the missing value in \(y\).
  • \(S_{x}\) is the selection score of the candidate sensor \(x\), in the range [0, 300].
  • \(\%vc_{x, y}\) is the percent of common valid cases of the sensors \(x\) and \(y\).
  • \(R_{x, y}^2\) is the Pearson’s R² of the common valid cases of the sensors \(x\) and \(y\).
  • \(microsite_{x}\) and \(microsite_{y}\) are the respective microsites of the sensors \(x\) and \(y\).

During data imputation, for each missing value, the sensor with the higher selection score is used to estimate it.

These criteria are stored in the data frame sensors_pairs. Along with the computation of the selection score, the code below also computes a linear model for each pair \(x ~ y\), and stores it in the object sensors_pairs_models. The identificators of these models are stored in the column model_id of the data frame sensors_pairs.

#combining sensors in pairs x-y
sensors_pairs <- combn(
  x = sensors,
  m = 2
) %>% 
  t() %>% 
  as.data.frame()

#adding combinations y-x so all pairs have both directions
#removing repeated pairs
#joining with moiscrust_NA to get sensor groups
#add column same_microsite to check if x and y are or not in the same sensor group
#add empty columns to store % of shared data, model's R squared, and model ID
sensors_pairs <- sensors_pairs %>% 
  rbind(
    data.frame(
      V1 = sensors_pairs$V2,
      V2 = sensors_pairs$V1
    )
  ) %>% 
  distinct(
    V1, 
    V2, 
    .keep_all = TRUE
  ) %>% 
  left_join(
    moiscrust_NA[, c("sensor", "microsite")],
    by = c("V1" = "sensor")
  ) %>% 
  left_join(
    moiscrust_NA[, c("sensor", "microsite")],
    by = c("V2" = "sensor")
  ) %>% 
  rename(
    y = V1,
    x = V2,
    y_microsite = microsite.x, #not a mistake
    x_microsite = microsite.y, #not a mistake
  ) %>% 
  mutate(
    same_microsite = ifelse(
      x_microsite == y_microsite, 
      TRUE, 
      FALSE
      ),
    sensors_shared_valid_percent = NA,
    sensors_r_squared = NA,
    model_id = row_number()
  ) 

#list to store models
sensors_pairs_models <- list()

#looping through sensors pairs to:
#fit lm model y ~ x and save it in sensors_pairs_models
#
for(i in 1:nrow(sensors_pairs)){
  
  #names of the sensors y and x
  y_i <- sensors_pairs[i, "y"]
  x_i <- sensors_pairs[i, "x"]
  
  #data of the sensor pair
  sensor_pair_i <- moiscrust[, c(y_i, x_i)]
  
  #complete cases of the sensor pair
  sensor_pair_i <- sensor_pair_i[complete.cases(sensor_pair_i), ]
   
  #common cases
  sensors_pairs[i, "sensors_shared_valid_percent"] <- 
    nrow(sensor_pair_i) / nrow(moiscrust) * 100
  
  #R squared of the sensor pair
  sensors_pairs[i, "sensors_r_squared"] <- cor(
    sensor_pair_i[, 1],
    sensor_pair_i[, 2]
    )
  
  #model formula y ~ x
  formula_i <- as.formula(paste(y_i, "~", x_i))
  
  #linear model
  sensors_pairs_models[[i]] <- lm(
    formula = formula_i,
    data = sensor_pair_i
  )
  
}

#selection score to find candidates during gap filling 
#(sensors_r_squared * 100) +
#sensors_shared_valid_percent + 
#same_microsite (TRUE = 100, FALSE = 0)
sensors_pairs <- mutate(
  sensors_pairs,
  selection_score = 
    (sensors_r_squared * 100) + 
    sensors_shared_valid_percent + 
    ifelse(same_microsite == TRUE, 100, 0)
)

#removing objects we don't need
rm(
  sensor_pair_i,
  formula_i,
  i,
  x_i,
  y_i
)

The resulting data frame, named sensors_pairs, has columns with the names of the sensor y (the one with missing data to impute), the sensor x (the candidate to be used as predictor to estimate y), their respective microsites, a column indicating if they belong to the same microsite, the percent of shared valid data, the R squared of their shared data, a

y x y_microsite x_microsite same_microsite sensors_shared_valid_percent sensors_r_squared model_id selection_score
retama5094 retama5062 retama retama TRUE 20.503945 0.8825786 1 208.76181
retama5094 retama5063 retama retama TRUE 28.736052 0.9065618 2 219.39223
retama5094 stipa5094 retama stipa FALSE 49.437792 0.6546707 3 114.90486
retama5094 stipa5062 retama stipa FALSE 34.751575 0.6570769 4 100.45926
retama5094 stipa5063 retama stipa FALSE 27.985724 0.8494829 5 112.93401
retama5094 bs_cl5094 retama biocrust_low FALSE 45.825898 0.8610431 6 131.93021
retama5094 bs_cl5062 retama biocrust_low FALSE 39.137445 0.6276224 7 101.89968
retama5094 bs_cl5063 retama biocrust_low FALSE 34.364586 0.8118848 8 115.55307
retama5094 bs_cm5094 retama biocrust_medium FALSE 44.099499 0.8808309 9 132.18258
retama5094 bs_cm5062 retama biocrust_medium FALSE 31.105282 0.7429876 10 105.40404
retama5094 bs_cm5063 retama biocrust_medium FALSE 5.892976 0.8384753 11 89.74050
retama5094 bs_ch5094 retama biocrust_high FALSE 48.414422 0.7630192 12 124.71634
retama5094 bs_ch5062 retama biocrust_high FALSE 38.305420 0.7500951 13 113.31493
retama5094 bs_ch5063 retama biocrust_high FALSE 32.943478 0.8490180 14 117.84528
retama5062 retama5063 retama retama TRUE 32.057704 0.8394976 15 216.00746
retama5062 stipa5094 retama stipa FALSE 42.633242 0.7653060 16 119.16384
retama5062 stipa5062 retama stipa FALSE 37.522843 0.5339814 17 90.92099
retama5062 stipa5063 retama stipa FALSE 31.636317 0.8561540 18 117.25171
retama5062 bs_cl5094 retama biocrust_low FALSE 43.684561 0.7455036 19 118.23492
retama5062 bs_cl5062 retama biocrust_low FALSE 48.896008 0.6940089 20 118.29689

4.2 Generating the x and y matrices to predict imputed values

During data imputation, two data frames are needed. The data frame x contains the data of every sensor for every available time, while the data frame y, which starts with empty values, is where the imputed values, their confidence intervals, selection criteria, and other quality-related columns are going to be stored.

The x data frame looks as follows:

retama5094 retama5062 retama5063 stipa5094 stipa5062 stipa5063 bs_cl5094 bs_cl5062 bs_cl5063 bs_cm5094 bs_cm5062 bs_cm5063 bs_ch5094 bs_ch5062 bs_ch5063
0.197 NA NA 0.132 NA NA 0.232 NA NA 0.205 NA NA 0.121 NA NA
0.195 NA NA 0.131 NA NA 0.233 NA NA 0.203 NA NA 0.117 NA NA
0.194 NA NA 0.131 NA NA 0.234 NA NA 0.203 NA NA 0.117 NA NA
0.194 NA NA 0.131 NA NA 0.234 NA NA 0.203 NA NA 0.116 NA NA
0.194 NA NA 0.130 NA NA 0.234 NA NA 0.203 NA NA 0.116 NA NA
0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.116 NA NA
0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.115 NA NA
0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.115 NA NA
0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.115 NA NA
0.194 NA NA 0.130 NA NA 0.234 NA NA 0.202 NA NA 0.116 NA NA
0.195 NA NA 0.131 NA NA 0.234 NA NA 0.202 NA NA 0.116 NA NA
0.195 NA NA 0.131 NA NA 0.233 NA NA 0.201 NA NA 0.115 NA NA
0.194 NA NA 0.130 NA NA 0.233 NA NA 0.200 NA NA 0.114 NA NA
0.193 NA NA 0.129 NA NA 0.233 NA NA 0.199 NA NA 0.112 NA NA
0.192 NA NA 0.129 NA NA 0.234 NA NA 0.199 NA NA 0.111 NA NA
0.192 NA NA 0.128 NA NA 0.234 NA NA 0.199 NA NA 0.111 NA NA
0.191 NA NA 0.128 NA NA 0.234 NA NA 0.198 NA NA 0.110 NA NA
0.190 NA NA 0.127 NA NA 0.234 NA NA 0.198 NA NA 0.109 NA NA
0.189 NA NA 0.127 NA NA 0.234 NA NA 0.198 NA NA 0.109 NA NA
0.188 NA NA 0.127 NA NA 0.234 NA NA 0.198 NA NA 0.109 NA NA

And this is the y data frame, that will be filled during data imputation:

interpolated model_estimate model_ci_lower model_ci_upper model_predictor same_microsite sensors_r_squared sensors_shared_valid_percent selection_score date_time_id sensor microsite
FALSE NA NA NA NA NA NA NA NA 1 NA NA
FALSE NA NA NA NA NA NA NA NA 2 NA NA
FALSE NA NA NA NA NA NA NA NA 3 NA NA
FALSE NA NA NA NA NA NA NA NA 4 NA NA
FALSE NA NA NA NA NA NA NA NA 5 NA NA
FALSE NA NA NA NA NA NA NA NA 6 NA NA
FALSE NA NA NA NA NA NA NA NA 7 NA NA
FALSE NA NA NA NA NA NA NA NA 8 NA NA
FALSE NA NA NA NA NA NA NA NA 9 NA NA
FALSE NA NA NA NA NA NA NA NA 10 NA NA
FALSE NA NA NA NA NA NA NA NA 11 NA NA
FALSE NA NA NA NA NA NA NA NA 12 NA NA
FALSE NA NA NA NA NA NA NA NA 13 NA NA
FALSE NA NA NA NA NA NA NA NA 14 NA NA
FALSE NA NA NA NA NA NA NA NA 15 NA NA
FALSE NA NA NA NA NA NA NA NA 16 NA NA
FALSE NA NA NA NA NA NA NA NA 17 NA NA
FALSE NA NA NA NA NA NA NA NA 18 NA NA
FALSE NA NA NA NA NA NA NA NA 19 NA NA
FALSE NA NA NA NA NA NA NA NA 20 NA NA

4.3 Data imputation, step by step

The steps to fill the gaps in the MOISCRUST database go as follows:

1. A given sensor name is selected: ā€œstipa5063ā€

2. The sensors pairs from the table sensors_pairs where the selected sensor is y (the sensor which values are to be imputed) are selected.

y x y_microsite x_microsite same_microsite sensors_shared_valid_percent sensors_r_squared model_id selection_score
stipa5063 bs_cl5094 stipa biocrust_low FALSE 47.42330 0.7826120 61 125.6845
stipa5063 bs_cl5062 stipa biocrust_low FALSE 52.09941 0.6922529 62 121.3247
stipa5063 bs_cl5063 stipa biocrust_low FALSE 66.13635 0.8021262 63 146.3490
stipa5063 bs_cm5094 stipa biocrust_medium FALSE 41.85712 0.8607496 64 127.9321
stipa5063 bs_cm5062 stipa biocrust_medium FALSE 42.63754 0.7499577 65 117.6333
stipa5063 bs_cm5063 stipa biocrust_medium FALSE 26.90646 0.8956584 66 116.4723
stipa5063 bs_ch5094 stipa biocrust_high FALSE 52.89274 0.7012173 67 123.0145
stipa5063 bs_ch5062 stipa biocrust_high FALSE 51.91452 0.7804496 68 129.9595
stipa5063 bs_ch5063 stipa biocrust_high FALSE 59.66504 0.7433338 69 133.9984
stipa5063 retama5094 stipa retama FALSE 27.98572 0.8494829 110 112.9340
stipa5063 retama5062 stipa retama FALSE 31.63632 0.8561540 123 117.2517
stipa5063 retama5063 stipa retama FALSE 45.96779 0.8168872 135 127.6565
stipa5063 stipa5094 stipa stipa TRUE 48.28543 0.5162195 146 199.9074
stipa5063 stipa5062 stipa stipa TRUE 48.22953 0.5211191 156 200.3414

3. The first row of the data frame x is selected.

retama5094 retama5062 retama5063 stipa5094 stipa5062 stipa5063 bs_cl5094 bs_cl5062 bs_cl5063 bs_cm5094 bs_cm5062 bs_cm5063 bs_ch5094 bs_ch5062 bs_ch5063
0.197 NA NA 0.132 NA NA 0.232 NA NA 0.205 NA NA 0.121 NA NA

3.1. If there is a valid value of soil moisture for the sensor ā€œstipa5063ā€, the algorithm goes to the next row, until there is a row with a missing value.

4. If there is an missing value (NA), the potential candidate predictors are selected from the row by removing the data of other sensors with NA, and the data of the target sensor.

retama5094 stipa5094 bs_cl5094 bs_cm5094 bs_ch5094
0.197 0.132 0.232 0.205 0.121

5. From these predictors, the one with the highest selection score is selected from the data frame sensors_pair generated in the step 2..

y x y_microsite x_microsite same_microsite sensors_shared_valid_percent sensors_r_squared model_id selection_score
stipa5063 stipa5094 stipa stipa TRUE 48.28543 0.5162195 146 199.9074

6. The model to use, stored in the list sensors_pairs_model, is selected from model_id column of the best_predictor data frame , and used to predict a value for the empty cell.

##         fit       lwr       upr
## 1 0.1435911 0.1419916 0.1451907

7. The imputed value, its confidence intervals, and other values about the imputation quality available in best_predictor are transferred to the same row in the data frame y.

8. Once all the sensors and rows have been processed this way, the matrix y is joined with moiscrust_long, and its interpolated values are transferred to the soil_moisture column, along with other columns indicating the quality of the interpolation.

4.4 Applying the algorithm to the complete dataset

The code below applies the algorithm to every sensor and row with missing data. Sensors are processed in parallel to speed up the data imputation operation.

#setup for parallel execution
if(.Platform$OS.type == "windows"){
  temp_cluster <- parallel::makeCluster(
    parallel::detectCores() - 1,
    type = "PSOCK"
  )
} else {
  temp_cluster <- parallel::makeCluster(
    parallel::detectCores() - 1,
    type = "FORK"
  )
}
doParallel::registerDoParallel(cl = temp_cluster)
    
#parallelized loop (each sensor is processed in one separated thread)
moiscrust_interpolation <- foreach::foreach(
  sensor_i = sensors,
  .packages = "tidyverse"
) %dopar% {
  
  #subset sensors_pairs
  sensors_pair_i <- sensors_pairs %>% 
    dplyr::filter(y == sensor_i)
  
  #fill microsite
  y[, "microsite"] <-sensors_pair_i$y_microsite[1]
  
  #scanning the rows of x one by one
  for(row_i in 1:nrow(x)){
    
    #if is not NA, next iteration
    if(!is.na(x[row_i, sensor_i])){next}
    
    #getting target row row
    x_row_i <- x[row_i, ]
    
    #getting predictor candidates available in x_row_i
    predictor_candidates_i <- as.vector(x_row_i)
    predictor_candidates_i <- predictor_candidates_i[which(
      !is.na(predictor_candidates_i) & 
        names(predictor_candidates_i) != sensor_i
      )]
    
    #selecting the predictor candidate with the best selection_score score
    best_predictor_i <- sensors_pair_i %>% 
      dplyr::filter(x %in% names(predictor_candidates_i)) %>%
      dplyr::arrange(desc(selection_score)) %>% 
      dplyr::slice(1)
    
    #if there is no best candidate available, next iteration
    if(nrow(best_predictor_i) == 0){next}
    
    #compute estimates with the model of the best predictor
    y[row_i, c(
      "model_estimate", 
      "model_ci_lower", 
      "model_ci_upper"
      )] <- predict(
        object = sensors_pairs_models[[best_predictor_i$model_id]],
        newdata = x_row_i,
        se.fit = TRUE,
        type = "response",
        interval = "confidence"
        )$fit
    
    #adding interpolation flag
    y[row_i, "interpolated"] <- TRUE
    y[row_i, "model_predictor"] <- best_predictor_i$x
    y[row_i, "sensors_r_squared"] <- best_predictor_i$sensors_r_squared
    y[row_i, "selection_score"] <- best_predictor_i$selection_score
    y[row_i, "sensors_shared_valid_percent"] <- best_predictor_i$sensors_shared_valid_percent
    y[row_i, "same_microsite"] <- best_predictor_i$same_microsite
    
  }
  
  #adding sensor_i name
  y[, "sensor"] <- sensor_i
  
  return(y)
  
}

#stop cluster
parallel::stopCluster(temp_cluster)

#removing loop objects
rm(
  x,
  y,
  temp_cluster
)

The imputation algorithm produces an object (a list) named moiscrust_interpolation, with one slot per sensor, each one with one y data frame containing the imputation results. Below we transform this object into the data frame moiscrust_interpolation_long and join it with moiscrust_long, to start preparing the database format.

The interpolation has removed all gaps where there was a value to interpolate from, as shown in the table below.

Sensor NA before interpolation NA after interpolation NA % before interpolation NA % after interpolation
bs_ch5062 11776 989 25.3 2.1
bs_ch5063 14125 989 30.4 2.1
bs_ch5094 7674 989 16.5 2.1
bs_cl5062 11188 989 24.1 2.1
bs_cl5063 11114 989 23.9 2.1
bs_cl5094 10987 989 23.6 2.1
bs_cm5062 18743 989 40.3 2.1
bs_cm5063 31225 989 67.1 2.1
bs_cm5094 16934 989 36.4 2.1
retama5062 22999 989 49.4 2.1
retama5063 20505 989 44.1 2.1
retama5094 22840 989 49.1 2.1
stipa5062 15363 989 33.0 2.1
stipa5063 15604 989 33.5 2.1
stipa5094 10553 989 22.7 2.1

4.5 Visualizing the interpolated time series

The MOISCRUST database looks as follows after applying the imputation algorithm.

The plot above represents both observed and interpolated values, without a clear differentiation between each type. To provide an indicator of imputation quality, the imputation algorithm also generated a new column named interpolation_quality, where the observations are marked with the flag ā€œobservationā€, the imputed data where x and y shared more than 20% of valid cases and had an R² higher than 0.85 are marked with the flag ā€œacceptableā€, and the imputed data below these thresholds are marked with the flag ā€œpoorā€. The plot below shows the values of these flags for each sensor and time, year per year.

5 Data base format

The dataset moiscrust_long is the database in long format already. Now we can rename it to moiscrust.

The database columns are:

6 Usage examples

This section shows several simple ways to work with the database in R, focusing on the minimal code required to perform basic operations. But first we have to unzip database.zip and read moiscrust.RData into the R environment.

6.1 Plotting time-series of observed data for a given microsite and year

The column interpolated contains TRUE for imputed data and FALSE for observed data, and therefore, it can be used to separate observed and imputed data right away. The column microsite groups together time-series belonging to the same microsite. Finally, the column sensor contains the data yielded by specific sensors.

6.2 Plotting the daily average of a microsite for a given set of years

To compute the daily average of a given microsite we first have to subset the data relative to such microsite for the given set of years.

To compute the mean of the three sensors of the selected microsite, the data is grouped by the columns year and year_day, and after that dplyr::summarise() is used to compute the average of soil_moisture, but removing NA values from the computation of the average.